Normalized Accessor Variety Combined with Conditional Random Fields in Chinese Word Segmentation
نویسندگان
چکیده
The word is the basic unit in natural language processing (NLP), as it is at the lexical level upon which further processing rests. The lack of word delimiters such as spaces in Chinese texts makes Chinese word segmentation (CWS) an interesting while challenging issue. This paper describes the in-depth research following our participation in the fourth International Chinese Language Processing Bakeoff 1 . Originally, we incorporate unsupervised segmentation into Conditional Random Fields (CRFs) in the purpose of dealing with unknown words. Normalization is delicately involved in order to cater to problem of small data size. Experiments on CWS corpora from Bakeoff-4 present comparable results with state-of-the-art performance.
منابع مشابه
Unsupervised Overlapping Feature Selection for Conditional Random Fields Learning in Chinese Word Segmentation
Wen-lian Hsu Institute of Information Science Academia Sinica [email protected] Abstract This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TC...
متن کاملEnglish-to-Chinese Machine Transliteration using Accessor Variety Features of Source Graphemes
This work presents a grapheme-based approach of English-to-Chinese (E2C) transliteration, which consists of many-to-many (M2M) alignment and conditional random fields (CRF) using accessor variety (AV) as an additional feature to approximate local context of source graphemes. Experiment results show that the AV of a given English named entity generally improves effectiveness of E2C transliteration.
متن کاملEnhancing LSTM-based Word Segmentation Using Unlabeled Data
Word segmentation problem is widely solved as the sequence labeling problem. The traditional way to this kind of problem is machine learning method like conditional random field with hand-crafted features. Recently, deep learning approaches have achieved state-of-theart performance on word segmentation task and a popular method of them is LSTM networks. This paper gives a method to introduce nu...
متن کاملEnhancement of Feature Engineering for Conditional Random Field Learning in Chinese Word Segmentation Using Unlabeled Data
This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-con...
متن کاملCost-benefit Analysis of Two-Stage Conditional Random Fields based English-to-Chinese Machine Transliteration
This work presents an English-to-Chinese (E2C) machine transliteration system based on two-stage conditional random fields (CRF) models with accessor variety (AV) as an additional feature to approximate local context of the source language. Experiment results show that two-stage CRF method outperforms the one-stage opponent since the former costs less to encode more features and finer grained l...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009